{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Advanced nodules sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In previous [tutorial](/1_Running_preprocessing.ipynb) you explored load/dump, preprocessing and sampling crops with nodules via sample_nodules method. Sample_nodules has share argument, which shows ratio of crops with nodules vs crops without nodules for balancing positive/negative crops in batch for better network training. However, crops without nodules are made randomly, which is not what you may want, as some locations are more likely to have nodules then others. For this purpose, it is possible to use any histogram (e.g. histogram of nodules locations) for sampling."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examples in this notebook use [LUNA16 competition dataset](https://luna16.grand-challenge.org/) in MetaImage (mhd/raw) format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import glob\n",
    "import shutil\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from ipywidgets import interact\n",
    "from copy import deepcopy\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sys.path.append('..')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from radio.batchflow import FilesIndex, Dataset, Pipeline\n",
    "from radio import CTImagesMaskedBatch as CTIMB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build histogram of nodules' positions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load a dataset with ct-scans (LUNA16), See previous [tutorial](/1_Running_preprocessing.ipynb) for clarification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "DIR_LUNA = '/notebooks/data/MRT/luna/s*/*.mhd'\n",
    "lunaix = FilesIndex(path=DIR_LUNA, no_ext=True)\n",
    "lunaset = Dataset(index=lunaix, batch_class=CTIMB)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dataset has 888 CT-scans"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "888"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(lunaset.indices)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load annotation file provided by LUNA16\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nodules = pd.read_csv('/notebooks/data/MRT/luna/CSVFILES/annotations.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a toy histogram with random uniform sampling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ranges = list(zip([0]*3, (32, 64, 64)))\n",
    "histo = list(np.histogram(np.random.uniform(low=0, high=1, size=(100, 3)) *\n",
    "                            np.array(SHAPE).reshape(1, -1), range=ranges, bins=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100.0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "histo[0].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use histogram to sample nodules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pipe = (Pipeline()\n",
    "        .load(fmt='raw', components='images')\n",
    "        .fetch_nodules_info(nodules)\n",
    "        .unify_spacing(shape=(384, 448, 448), spacing=(1.7, 1.0, 1.0))\n",
    "        .create_mask()\n",
    "        .sample_nodules(batch_size=10, nodule_size=(32, 64, 64), share=0.5,\n",
    "                        histo=histo, variance=(20, 70, 70))\n",
    "       )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We may use pipeline as generator, let's specify batch_size = 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "gen = (lunaset >> pipe).gen_batch(batch_size=5, n_epochs=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nods = next(gen)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Get only (and all) cancerous nodules"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While sampling nodules, it's possible to access only cancerous/non-cancerous nodules in batch, let's create a subset of 5 ct-scans, load it and run preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bch = CTIMB(lunaix.create_subset(lunaix.indices[[100, 110, 120, 130, 140]]))\n",
    "\n",
    "bch = bch.load(fmt='raw', components='images')\n",
    "\n",
    "bch = bch.fetch_nodules_info(nodules)\n",
    "bch = bch.create_mask()\n",
    "bch = bch.unify_spacing(shape=(384, 448, 448), spacing=[1.7, 0.9, 0.9])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, you can use ```sample_nodules``` with ```share=1``` and ```batch_size=None```. Then all your batch would consist of crops with all nodules that are marked in annotation for these patients's scans."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "crop_bch = bch.sample_nodules(nodule_size=(32, 64, 64), batch_size=None, share=1,\n",
    "                                 variance=(49, 196, 196))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 134,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crop_bch.num_nodules == len(crop_bch)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also, you can set any ```share```, say 0.6 and ```batch_size=None```. Then your batch would consist of crops with all nodules that are marked in annotation for these patients's scans AND some additional random crops without nodules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "crop_bch = bch.sample_nodules(nodule_size=(32, 64, 64), batch_size=None, share=0.6,\n",
    "                                 variance=(49, 196, 196))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('number of nodules:'crop_bch.num_nodules,', total number of crops in batch:',len(crop_bch))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can easily find crops with nodules using batch's index method, let's find third nodule crop:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nodnum = 2\n",
    "\n",
    "nodix = crop_bch.indices[nodnum]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3bdbcb3f017f4ba4a577e5beb9e4783a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "A Jupyter Widget"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "<function __main__.<lambda>>"
      ]
     },
     "execution_count": 148,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "interact(lambda height: plot_arr_slices(height, \n",
    "                                        only_cancer.get(nodix, 'images'),\n",
    "                                        only_cancer.get(nodix, 'masks'),\n",
    "                                        only_cancer.get(nodix, 'masks')),\n",
    "         height=(0.01, 0.99, 0.01))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6+"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
